This document covers broad information about good research practices while working with data. This has been created as a reference for the Ohm lab @RPCCC. The primary reference for this document is the “Reproducible Research Practices” workshop organized by Alex’s Lemonade Stand Foundation’s Childhood Cancer Data Lab on May 14-15, 2024. All other references are included at the end of this document.


What is reproducibility? Why does it matter?

  • Reproducibility means different things depending on the context. For us, in the context of bioinformatics and computational oncology, reproducibility means to provide enough information (data, code, methods, etc.) to allow someone to replicate the exact analysis and arrive at the same results.
Reproducibility crisis:

According to a study published in Nature in 2016, >70% researchers failed to reproduce someone else’s work and >50% failed to reproduce their own work (Survey of ~1500 researchers).

  • Scientific papers often don’t have enough information to allow for reproducibility of the results.
  • This leads to decreased reliability of pulished literature and causes distrust.
Cartoon “Scratch” from www.phdcomics.com
Cartoon “Scratch” from www.phdcomics.com

Reproducibility = Obtaining the same results when using the same code, data and conditions of analysis.

  • Reproducibility doesn’t always mean that the analysis/result is correct! It just makes sure that the research is transparent. It is the minimum requirement for good data science. BUT this alone is not sufficient.

Barriers for reproducible research
  1. Bias -

    • Publication bias:

    • Information bias:

Replicability = Obtaining similar results across studies using different data.

  • A replicable study can show that the original study was reliable.

  • Together, reproducibility and replicability enhance the reliability of the results. It acts as a quality check and reduces bias.

Project organization

Sources:
+ Vince Buffalo: (https://www.oreilly.com/library/view/bioinformatics-data-skills/9781449367480/)
+ Jenny Bryan: https://speakerdeck.com/jennybc/how-to-name-files
+ Danielle Navarro: https://slides.djnavarro.net/project-structure

Why is organization important?

It makes things easier to find. This is done by defining a standard and making the organization more predictable.

Best practices when organizing a project:

  • Use a lot of folders and directories.
  • Keep projects separated and keep sections within a project separate.

Essential folders/directories for any project

  • README, Data, Analysis/Code/Scripts, Figures, Results
README.md
  • README files include essential information about a project that another user should know about.
  • This would be the first file anyone would read when trying to understand your project.
  • What a README file should include:
    • Project title - make it informative.
    • Project summary - including information of any specific methods/techniques used.
    • Project organization - if the file need to be read in a specific format.
  • Other things which could be included:
    • More details! - always appreciated. Any other details which made make it easy for someone to read and use your project.
    • Future directions?
    • Any specific challenges faced?
Data Folder
  • Only big files go here! These files are used repeatedly
  • Raw -> This usually from external sources. Try not to modify this. To avoid modifications, you can change the settings to make it unmodifiable.
  • Separate subfolders for processed files.
  • Use subdirectories by processing stage, date or sample.

It is often easiest to process all the things in a folder together; organize by units of work

NAMING FILES AND FOLDERS

Pick a file naming convention. It doesn’t have to be the same as mine but make sure it meets some basic criteria listed below!

Make it informative.Should know what it is even without opening (preferably!)

Jenny Bryan’s standard:

Machine friendly:
  1. Avoid spaces
    - Use underscores or dashes
RRBS_data_analysis.R  

instead of

RRBS data analysis.R   
  1. Use standard characters only!
    - letters, numbers, underscores and dashes.
    - Use periods only for file extensions.
    - Avoid using special characters.
Differential_methylation.R  

instead of

Differential.methylation.R      
  1. Be consistent with case!
    - Don’t have two files with same name which only differ in case.
    - Case may/may not have meaning, depends on the programming language and OS.

  2. Globbing
    - Have multiple chunks in file names. Separate chunks with underscores.
    - This also allows the use of wildcards (*, and others) to call all files that start or end with specific characters.
    - Globbing allows to select file which match a “pattern”
    - More information:

Globbing
Globbing
  • What’s even better than Globbing? - regex or regular expression.
    • Allows to extract information from file names.
    • Could be easily parsed into a dataframe using R or any othe programming language.
    • Makes things very easy for you and others!
Example of regex file naming
Example of regex file naming
Human friendly:
  1. Use long descriptive names (short names can be tempting) - to be able to know what each file is without opening!

Don’t:

Analysis1.sh  
Analysis2.sh  

Do:

Alignment_Human_Genome.sh    
Alignment_Mouse_Genome.sh  
Sortable :
  1. Use numbers for sorting. Use 01, 02 (left pad).
  2. Use ISO 8601 standard dates (year-month-date or YYYY-MM-DD) which is easier for sorting.
01.mouse_adapter-trimming.sh
02.human_adapter-trimming.R

#Don't:

10.trimmed.txt 
1.stuff-a.csv
2.stuff-b.csv

Also:

2024-03-01_plasmid_sequence.txt
2024-04-27_cell-line_sequence.txt

#Don't

01-15-2024-backup.csv
22-05-2024_foo.R
Computable :
  1. For files from someone else.
    - It is usually not advisable to rename them because the original names are easier to track in conversation. - When renaming : Use scripts to track the changes made. Only rename when absolutely necessary!!!
Let’s not be this!
Let’s not be this!

Strategies for Data Sharing

  • data = all information to remake, interpret or use any figure, table, etc. i.e code(versions, specific commands ran, code for preprocessing, post processing, visualization), metadata (about the sequencing and all the samples), documentation, data (raw, processed).
  • Coordinate with code and maintain records of all scripts, even if it is analyzed by the core
  • Maintain both raw and processed data and deposit them at repositories.
  • Be mindful of patient privacy when working with human subjects’ data.
  • Documentation:: Contextual information - description of the data, what each column name means. what does NA means? Origin of the data, software used to prepare it?

Increases reliability

  • Using a data repository: unique identifiers which can be cited, has policies for long-term retention, data sharing policies.
  • Choosing repositories - generaalist or specialist, raw or processed data
  • Complementary sharing avenues - lab/university server, Github, Zenodo (used with github)
  • Considerations::
    • accessible considerations. - ?
    • Interoperable - plain text files>>>> special formats (Excel, Matlab), use ontologies to obtain shared vocab - use hierachial graph; Eg: T-cells - not descriptive enough! Use a specific ontology term (specific type of T-cell)
  • Reusable considerations

Make a plan/lab-wide policy and stick to it! Naming, organizing, sharing data files and how will you store/backup data

Organizing code in scripts and notebooks

  • Read and use style guides:: https://style.tidyverse.org/

  • Important style points to think about:

    • Variable and function name

    • indentation and spacing - tabs vs spaces, space before and after brackets & =

data = read.csv("/path/to/file")  
data=read.csv("/path/to/csv")
  • - commenting style
- Load all the packages up front at the top of the script instead of having random chunks of code with `library()`
  • Comments are for the future you and collaborators. Explain why you are doing what you are doing!

When updating code, remebmer to update comments too.

  • Set-up and use R projects whenever applicable!

Managing packages and environment.

  • Changes occur at all levels - scripts, packages, individual programs, OS, hardware!
  • For the analysis layer - we are using Git and GitHub. - use “tags” and “releases” on GitHub - read more
  • For package layer - changes in versions can affect the analysis, dependencies may require specific versions of packages. use sessionInfo() in R or sessioninfo::session_info()
  • Using the same versions of the packages:
    • renv R package! - tracks, freezes and shares R encireonments
    • Each project can have its own environment with its own set of packages
    • doesnt use system R package library - creates library for each packages, but in an optimized way.
    • create renv.lock() file to describe the library.

Other references:
1. “Reproducibility and Replicability in Science.” - https://www.ncbi.nlm.nih.gov/books/NBK547546/#
2.

Github Cheat sheet

Git
Git cheat sheet
Git cheat sheet